382 research outputs found

    Objective Classification of Galaxy Spectra using the Information Bottleneck Method

    Get PDF
    A new method for classification of galaxy spectra is presented, based on a recently introduced information theoretical principle, the `Information Bottleneck'. For any desired number of classes, galaxies are classified such that the information content about the spectra is maximally preserved. The result is classes of galaxies with similar spectra, where the similarity is determined via a measure of information. We apply our method to approximately 6000 galaxy spectra from the ongoing 2dF redshift survey, and a mock-2dF catalogue produced by a Cold Dark Matter-based semi-analytic model of galaxy formation. We find a good match between the mean spectra of the classes found in the data and in the models. For the mock catalogue, we find that the classes produced by our algorithm form an intuitively sensible sequence in terms of physical properties such as colour, star formation activity, morphology, and internal velocity dispersion. We also show the correlation of the classes with the projections resulting from a Principal Component Analysis.Comment: submitted to MNRAS, 17 pages, Latex, with 14 figures embedde

    Information based clustering

    Full text link
    In an age of increasingly large data sets, investigators in many different disciplines have turned to clustering as a tool for data analysis and exploration. Existing clustering methods, however, typically depend on several nontrivial assumptions about the structure of data. Here we reformulate the clustering problem from an information theoretic perspective which avoids many of these assumptions. In particular, our formulation obviates the need for defining a cluster "prototype", does not require an a priori similarity metric, is invariant to changes in the representation of the data, and naturally captures non-linear relations. We apply this approach to different domains and find that it consistently produces clusters that are more coherent than those extracted by existing algorithms. Finally, our approach provides a way of clustering based on collective notions of similarity rather than the traditional pairwise measures.Comment: To appear in Proceedings of the National Academy of Sciences USA, 11 pages, 9 figure

    Propagation of charged particle waves in a uniform magnetic field

    Full text link
    This paper considers the probability density and current distributions generated by a point-like, isotropic source of monoenergetic charges embedded into a uniform magnetic field environment. Electron sources of this kind have been realized in recent photodetachment microscopy experiments. Unlike the total photocurrent cross section, which is largely understood, the spatial profiles of charge and current emitted by the source display an unexpected hierarchy of complex patterns, even though the distributions, apart from scaling, depend only on a single physical parameter. We examine the electron dynamics both by solving the quantum problem, i. e., finding the energy Green function, and from a semiclassical perspective based on the simple cyclotron orbits followed by the electron. Simulations suggest that the semiclassical method, which involves here interference between an infinite set of paths, faithfully reproduces the features observed in the quantum solution, even in extreme circumstances, and lends itself to an interpretation of some (though not all) of the rich structure exhibited in this simple problem.Comment: 39 pages, 16 figure

    Motif Discovery through Predictive Modeling of Gene Regulation

    Full text link
    We present MEDUSA, an integrative method for learning motif models of transcription factor binding sites by incorporating promoter sequence and gene expression data. We use a modern large-margin machine learning approach, based on boosting, to enable feature selection from the high-dimensional search space of candidate binding sequences while avoiding overfitting. At each iteration of the algorithm, MEDUSA builds a motif model whose presence in the promoter region of a gene, coupled with activity of a regulator in an experiment, is predictive of differential expression. In this way, we learn motifs that are functional and predictive of regulatory response rather than motifs that are simply overrepresented in promoter sequences. Moreover, MEDUSA produces a model of the transcriptional control logic that can predict the expression of any gene in the organism, given the sequence of the promoter region of the target gene and the expression state of a set of known or putative transcription factors and signaling molecules. Each motif model is either a kk-length sequence, a dimer, or a PSSM that is built by agglomerative probabilistic clustering of sequences with similar boosting loss. By applying MEDUSA to a set of environmental stress response expression data in yeast, we learn motifs whose ability to predict differential expression of target genes outperforms motifs from the TRANSFAC dataset and from a previously published candidate set of PSSMs. We also show that MEDUSA retrieves many experimentally confirmed binding sites associated with environmental stress response from the literature.Comment: RECOMB 200

    Psoriasis prediction from genome-wide SNP profiles

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>With the availability of large-scale genome-wide association study (GWAS) data, choosing an optimal set of SNPs for disease susceptibility prediction is a challenging task. This study aimed to use single nucleotide polymorphisms (SNPs) to predict psoriasis from searching GWAS data.</p> <p>Methods</p> <p>Totally we had 2,798 samples and 451,724 SNPs. Process for searching a set of SNPs to predict susceptibility for psoriasis consisted of two steps. The first one was to search top 1,000 SNPs with high accuracy for prediction of psoriasis from GWAS dataset. The second one was to search for an optimal SNP subset for predicting psoriasis. The sequential information bottleneck (sIB) method was compared with classical linear discriminant analysis(LDA) for classification performance.</p> <p>Results</p> <p>The best test harmonic mean of sensitivity and specificity for predicting psoriasis by sIB was 0.674(95% CI: 0.650-0.698), while only 0.520(95% CI: 0.472-0.524) was reported for predicting disease by LDA. Our results indicate that the new classifier sIB performs better than LDA in the study.</p> <p>Conclusions</p> <p>The fact that a small set of SNPs can predict disease status with average accuracy of 68% makes it possible to use SNP data for psoriasis prediction.</p

    Discrete profile comparison using information bottleneck

    Get PDF
    Sequence homologs are an important source of information about proteins. Amino acid profiles, representing the position-specific mutation probabilities found in profiles, are a richer encoding of biological sequences than the individual sequences themselves. However, profile comparisons are an order of magnitude slower than sequence comparisons, making profiles impractical for large datasets. Also, because they are such a rich representation, profiles are difficult to visualize. To address these problems, we describe a method to map probabilistic profiles to a discrete alphabet while preserving most of the information in the profiles. We find an informationally optimal discretization using the Information Bottleneck approach (IB). We observe that an 80-character IB alphabet captures nearly 90% of the amino acid occurrence information found in profiles, compared to the consensus sequence's 78%. Distant homolog search with IB sequences is 88% as sensitive as with profiles compared to 61% with consensus sequences (AUC scores 0.73, 0.83, and 0.51, respectively), but like simple sequence comparison, is 30 times faster. Discrete IB encoding can therefore expand the range of sequence problems to which profile information can be applied to include batch queries over large databases like SwissProt, which were previously computationally infeasible

    Systems biology via redescription and ontologies (I): finding phase changes with applications to malaria temporal data

    Get PDF
    Biological systems are complex and often composed of many subtly interacting components. Furthermore, such systems evolve through time and, as the underlying biology executes its genetic program, the relationships between components change and undergo dynamic reorganization. Characterizing these relationships precisely is a challenging task, but one that must be undertaken if we are to understand these systems in sufficient detail. One set of tools that may prove useful are the formal principles of model building and checking, which could allow the biologist to frame these inherently temporal questions in a sufficiently rigorous framework. In response to these challenges, GOALIE (Gene ontology algorithmic logic and information extractor) was developed and has been successfully employed in the analysis of high throughput biological data (e.g. time-course gene-expression microarray data and neural spike train recordings). The method has applications to a wide variety of temporal data, indeed any data for which there exist ontological descriptions. This paper describes the algorithms behind GOALIE and its use in the study of the Intraerythrocytic Developmental Cycle (IDC) of Plasmodium falciparum, the parasite responsible for a deadly form of chloroquine resistant malaria. We focus in particular on the problem of finding phase changes, times of reorganization of transcriptional control

    Ballistic matter waves with angular momentum: Exact solutions and applications

    Full text link
    An alternative description of quantum scattering processes rests on inhomogeneous terms amended to the Schroedinger equation. We detail the structure of sources that give rise to multipole scattering waves of definite angular momentum, and introduce pointlike multipole sources as their limiting case. Partial wave theory is recovered for freely propagating particles. We obtain novel results for ballistic scattering in an external uniform force field, where we provide analytical solutions for both the scattering waves and the integrated particle flux. Our theory directly applies to p-wave photodetachment in an electric field. Furthermore, illustrating the effects of extended sources, we predict some properties of vortex-bearing atom laser beams outcoupled from a rotating Bose-Einstein condensate under the influence of gravity.Comment: 42 pages, 8 figures, extended version including photodetachment and semiclassical theor

    Paradigm of tunable clustering using binarization of consensus partition matrices (Bi-CoPaM) for gene discovery

    Get PDF
    Copyright @ 2013 Abu-Jamous et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.Clustering analysis has a growing role in the study of co-expressed genes for gene discovery. Conventional binary and fuzzy clustering do not embrace the biological reality that some genes may be irrelevant for a problem and not be assigned to a cluster, while other genes may participate in several biological functions and should simultaneously belong to multiple clusters. Also, these algorithms cannot generate tight clusters that focus on their cores or wide clusters that overlap and contain all possibly relevant genes. In this paper, a new clustering paradigm is proposed. In this paradigm, all three eventualities of a gene being exclusively assigned to a single cluster, being assigned to multiple clusters, and being not assigned to any cluster are possible. These possibilities are realised through the primary novelty of the introduction of tunable binarization techniques. Results from multiple clustering experiments are aggregated to generate one fuzzy consensus partition matrix (CoPaM), which is then binarized to obtain the final binary partitions. This is referred to as Binarization of Consensus Partition Matrices (Bi-CoPaM). The method has been tested with a set of synthetic datasets and a set of five real yeast cell-cycle datasets. The results demonstrate its validity in generating relevant tight, wide, and complementary clusters that can meet requirements of different gene discovery studies.National Institute for Health Researc

    The structure of the PapD-PapGII pilin complex reveals an open and flexible P5 pocket

    Get PDF
    P pili are hairlike polymeric structures that mediate binding of uropathogenic Escherichia coli to the surface of the kidney via the PapG adhesin at their tips. PapG is composed of two domains: a lectin domain at the tip of the pilus followed by a pilin domain that comprises the initial polymerizing subunit of the 1,000-plus-subunit heteropolymeric pilus fiber. Prior to assembly, periplasmic pilin domains bind to a chaperone, PapD. PapD mediates donor strand complementation, in which a beta strand of PapD temporarily completes the pilin domain's fold, preventing premature, nonproductive interactions with other pilin subunits and facilitating subunit folding. Chaperone-subunit complexes are delivered to the outer membrane usher where donor strand exchange (DSE) replaces PapD's donated beta strand with an amino-terminal extension on the next incoming pilin subunit. This occurs via a zip-in-zip-out mechanism that initiates at a relatively accessible hydrophobic space termed the P5 pocket on the terminally incorporated pilus subunit. Here, we solve the structure of PapD in complex with the pilin domain of isoform II of PapG (PapGIIp). Our data revealed that PapGIIp adopts an immunoglobulin fold with a missing seventh strand, complemented in parallel by the G1 PapD strand, typical of pilin subunits. Comparisons with other chaperone-pilin complexes indicated that the interactive surfaces are highly conserved. Interestingly, the PapGIIp P5 pocket was in an open conformation, which, as molecular dynamics simulations revealed, switches between an open and a closed conformation due to the flexibility of the surrounding loops. Our study reveals the structural details of the DSE mechanism
    corecore